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Abstract 

Distributed averaging describes a class of network algorithms for the decentralized computation of 
aggregate statistics. Initially, each node has a scalar data value, and the goal is to compute the average 
of these values at every node (the so-called average consensus problem). Nodes iteratively exchange 
information with their neighbors and perform local updates until the value at every node converges to the 
initial network average. Much previous work has focused on algorithms where each node maintains and 



> 

C*~) . updates a single value; every time an update is performed, the previous value is forgotten. Convergence 

in 

£f~) • to the average consensus is achieved asymptotically. The convergence rate is fundamentally limited by 

network connectivity, and it can be prohibitively slow on topologies such as grids and random geometric 
ON , graphs, even if the update rules are optimized. In this paper, we provide the first theoretical demonstration 



that adding a local prediction component to the update rule can significantly improve the convergence 
rate of distributed averaging algorithms. We focus on the case where the local predictor is a linear 



combination of the node's current and previous values (i.e., two memory taps), and our update rule 
computes a combination of the predictor and the usual weighted linear combination of values received 
from neighbouring nodes. We derive the optimal mixing parameter for combining the predictor with the 
neighbors' values, and conduct a theoretical analysis of the improvement in convergence rate that can be 
achieved using this acceleration methodology. For a chain topology on N nodes, this leads to a factor 
of N improvement over standard consensus, and for a two-dimensional grid, our approach achieves a 
factor of i/N improvement. 

Index Terms 

Distributed signal processing, average consensus, linear prediction. 
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I. Introduction 

Distributed algorithms for solving the average consensus problem have received considerable attention 
in the distributed signal processing and control communities recently, due to their applications in wireless 
sensor networks and distributed control of multi-agent systems Hl-Q- See (U for a survey. In the average 
consensus problem, each node initially has a value, e.g., captured by a sensor, and the goal is to calculate 
the average of these initial values at every node in the network under the constraint that information can 
only be exchanged locally, between nodes that communicate directly. 

This paper examines the class of synchronous distributed averaging algorithms that solve the average 
consensus problem. In this framework, which can be traced back to the seminal work of Tsitsiklis (9), each 
node maintains a local estimate of the network average. In the simplest form of a distributed averaging 
algorithm, one iteration consists of having all nodes exchange values with their neighbors and then update 
their local average with a weighted linear sum of their previous estimate and the estimates received from 
their neighbors. This update can be expressed as a simple recursion of the form x(t + l) = Wx(t), where 
Xi(t) is the estimate after t iterations at node i, and the matrix W contains the weights used to perform 
updates at each node. (Note, Wjj / only if nodes i and j communicate directly, since information 
is only exchanged locally at each iteration.) Xiao and Boyd |[T0l prove that, so long as the matrix W 
satisfies mild contraction conditions, the values Xi(t) converge asymptotically to the initial average, as 
t — >■ oo. However, Boyd et al. ifTll have shown that for important network topologies — such as the 
two-dimensional grid or random geometric graph, which are commonly used to model connectivity in 
wireless networks — this type of distributed averaging can be prohibitively slow, even if the weight 
matrix is optimized, requiring a number of iterations that grows quickly with network size. 

Numerical simulations have demonstrated that predictive consensus algorithms can converge much 
faster |4), lfT2l - lfT51 . These algorithms employ local node-memory, and change the algorithm so that 
the state-update becomes a mixture of a network-averaging and a prediction. But there has been no 
theoretical proof that they provide better performance, nor has there been any analytical characterization 
of the improvement they can provide. In addition, the algorithms have required intensive initialization 
to calculate their parameters. In this paper, we provide the first theoretical results quantifying the 
improvement obtained by predictive consensus over standard memoryless consensus algorithms. We focus 
on a linear predictor and derive a closed-form expression for the optimal mixing parameter one should 
use to combine the local prediction with the neighbourhood averaging. We analytically characterize the 
convergence rate improvement and describe a simple decentralized algorithm for initialization. 
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A. Related Work 

Two major approaches to accelerating the convergence of consensus algorithms can be identified: 
optimizing the weight matrix CD, (31, ifTOl . ifTTI . and incorporating memory into the distributed averaging 
algorithm HI, lfl2l - |[T6l . The spectral radius of the weight matrix governs the asymptotic convergence 
rate, so optimizing the weight matrix corresponds to minimizing the spectral radius, subject to connectivity 
constraints JT], ifTUl . IfTTI . Xiao et al. formulate the optimization as a semi-definite problem and describe 
a decentralized algorithm using distributed orthogonal iterations ID, |[T0l , IfTTI . Although elegant and 
efficient, this approach involves substantial initialization costs, and the improvement does not scale in 
grid or random geometric graph topologies (the averaging time is improved by a constant factor). 

A more promising research direction is based on using local node memory. The idea of using higher- 
order eigenvalue shaping filters was discussed in |@), but the problem of identifying optimal filter 
parameters was not solved. In lPT2l Cao et al. proposed a memory-based acceleration framework for 
gossip algorithms where updates are a weighted sum of previous state values and gossip exchanges, but 
they provide no solutions or directions for weight vector design or optimization. Johansson and Johans- 
son lfT31 advocate a similar scheme for distributed consensus averaging. They investigate convergence 
conditions and use standard solvers to find a numerical solution for the optimal weight vector. Recently, 
polynomial filtering was introduced for consensus acceleration, with the optimal weight vector again 
determined numerically lfl4ll . Analytical solutions for the topology-dependent optimal weights have not 
been considered in previous work |[T2ll - |fT5l and, consequently, there has been no theoretical convergence 
rate analysis for variants of distributed averaging that use memory to improve the convergence rate. 

Ay sal et al. proposed the mixing of neighbourhood averaging with a local linear predictor in |[T3l . 
The algorithm we analyze belongs to the general framework presented therein. Although the algorith- 
mic framework in |[T3l allows for multi-tap linear predictors, the analysis focuses entirely on one-tap 
prediction. Since one-tap prediction uses only the current state-value (and the output of neighbourhood 
averaging), the procedure is equivalent to modification of the memoryless consensus weight matrix. As 
such, the convergence rate improvement cannot be better than that achieved by optimizing the weight 
matrix as in flTJ , IfTOl , IfTTI . Aysal et al. also present numerical simulations for acceleration involving multi- 
tap predictors, which showed much greater improvement in convergence rate. However, they provided no 
method to choose or initialize the algorithmic parameters, so it was impossible to implement the algorithm 
in practice. There was no theoretical analysis demonstrating that the predictive acceleration procedure 
could consistently outperform memoryless consensus and no characterization of the improvement. 
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An extreme approach to consensus acceleration is the methodology proposed in [16]. Based on the 
notion of observability in linear systems, the algorithm achieves consensus in a finite number of iterations. 
Each node records the entire history of values {xi(t)}f =0 , and after enough iterations, inverts this 
history to recover the network average. In order to carry out the inversion, each node needs to know a 
topology-dependent set of weights. This leads to complicated initialization procedures for determining 
these weights. Another drawback is that the memory required at each node grows with the network size. 

B. Summary of Contributions 

We analyze a simple, scalable and efficient framework for accelerating distributed average consensus. 
This involves the convex combination of a neighborhood averaging and a local linear prediction. We 
demonstrate theoretically that a simple two-tap linear predictor is sufficient to achieve dramatic improve- 
ments in the convergence rate. For this two-tap case, we provide an analytical solution for the optimal 
mixing parameter and characterize the achieved improvement in convergence rate. We show that the 
performance gain grows with increasing network size at a rate that depends on the (expected) spectral 
gap of the original weight matrixQ. As concrete examples, we show that for a chain topology on N 
nodes, the proposed method achieves a factor of N improvement over memoryless consensus, and for 
a two-dimensional grid, a factor of vN improvement, in terms of the number of iterations required 
to reach a prescribed level of accuracy. We report the results of numerical experiments comparing our 
proposed algorithm with standard memoryless consensus, the polynomial filter approach of lfl4l and finite- 
time consensus |[T6l . The proposed algorithm converges much more rapidly than memoryless consensus, 
outperforms the polynomial filtering approach of |[T4l . and achieves performance comparable to finite- 
time consensus for random geometric graph topologies. We also present a novel, efficient approach for 
initialization of the accelerated algorithm. The initialization overhead is much less than that of other 
acceleration methods, rendering the scheme more practical for implementation. 

C. Paper Organization 

The remainder of this paper is structured as follows. Section JI] introduces the distributed average 
consensus framework and outlines the linear prediction-based acceleration methodology. Section [III] 
provides the main results, including the optimal value of the mixing parameter for the two-tap predictor, 

'The expectation is appropriate for families of random graphs, and is taken over the set of random graphs for a specified 
number of nodes. For deterministic topologies (e.g., grid, chain) the same result applies without expectation. 
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an analysis of convergence rate and processing gain, and a practical heuristic for efficient distributed 
initalization. We report the results of numerical experiments in Section [TV] and provide proofs of the 
main results together with accompanying discussion in Section [V] Section [VI] concludes the paper. 

II. Problem Formulation 

We assume that a network of N nodes is given, and that the communication topology is specified in 
terms of a collection of neighborhoods of each node: Mi C {1, . . . , N} is the set of nodes with whom 
node i communicates directly. For j G Mi, we will also say that there is an edge between i and j, and 
assume that connectivity is symmetric; i.e., j G Mi implies that i G Mj. The cardinality of Mi, d{ = \Mi\, 
is called the degree of node i. We assume that the network is connected, meaning that there is a path (a 
sequence of adjacent edges) connecting every pair of nodes. 

Initially, each node i = 1, . . . , N has a scalar value Xj(0) G M, and the goal is to develop a distributed 
algorithm such that every node computes x(0) = jj 2i=i Previous studies (see, e.g., |9l or ifTOlD 

have considered linear updates of the form 

Xi {t + 1) = W U Xi(t) + w i3 x 5$), (!) 
jejVi 

where J2j Wij = 1, and Wjj / only if j G Mi. Stacking the values xi(t), . . . , xjv(i) into a column 
vector, one network iteration of the algorithm is succinctly expressed as the linear recursion x(t + 1) = 
Wx(i). Let 1 denote the vector of all ones. For this basic setup, Xiao and Boyd iflOll have shown 
that necessary and sufficient conditions on W which ensure convergence to the average consensus, 
x(0) = 5(0)1, are 

Wl = 1, 1 T W = 1 T , p(W - J) < 1, (2) 
where J is the averaging matrix, J = ^fll T , and p(A) denotes the spectral radius of a matrix A: 

p(A) ^max{|Ai| : i = 1,2, . . . ,N}, (3) 

i 

where {Aj}^ 1 denote the eigenvalues of A. Algorithms have been identified for locally generating 
weight matrices that satisfy the required convergence conditions if the underlying graph is connected, 
e.g., Maximum-degree and Metropolis-Hastings weights HI, ifTTl . 

Empirical evidence suggests that the convergence of the algorithm can be significantly improved 
by using local memory lfT3l - |[T5l . The idea is to exploit smooth convergence of the algorithm, using 
current and past values to predict the future trajectory. In this fashion, the algorithm achieves faster 
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convergence by bypassing intermediate states. Each update becomes a weighted mixture of a prediction 
and a neighborhood averaging, but the mixture weights must be chosen carefully to ensure convergence. 

The simplest case of local memory is two taps (a single tap is equivalent to storing only the current 
value, as in standard distributed averaging), and this is the case we consider in this paper. The primary goal 
of this paper is to prove that local memory can always be used to improve the convergence rate and show 
that the improvement is dramatic; it is thus sufficient to examine the simplest case. For two taps of memory, 
prediction at node i is based on the previous state value Xi(t — 1), the current value Xi{t), and the value 
achieved by one application of the original averaging matrix, i.e. xf 1 (t+1) = WuXi(t) + J2j £ j\f. WijXj(t). 
The state-update equations at a node become a combination of the predictor and the value derived by 
application of the consensus weight matrix (this is easily extended for predictors with longer memories; 
see |[T3l , ifTSl ). In the two-tap memory case, we have: 

Xi (t + 1) = axf(t + 1) + (1 - a)xf (t + 1) (4a) 
xf(t + 1) = W u xi(t) + Wyxtf) (4b) 

xf (t + 1) = e 3 x™(t + 1) + e 2Xi (t) + e lXi (t - 1). (4c) 

Here = [61,62, 63] is the vector of predictor coefficients. 

The network-wide equations can then be expressed in matrix form by defining 

W 3 [a] = (1 - a + a0 3 )W + a0 2 I, (5) 
X(t)4[ x (tf,x(*-1)T. (6) 
where I is the identity matrix of the appropriate size, X(i) is the memory vector, and 

* 3 [«] = 

Each block of the above matrix has dimensions N x N. We also define x(— 1) = x(0) so that X(0) = 
[x(0) T x(0) T ]. The update equation is then simply X(i + 1) = * 3 [a]X(t). 

III. Main Results 

This section presents the main results of the paper. Proofs and more detailed discussion are deferred 
to Section [V] We first present in Section IIII-AI a discussion of how to optimize the two-tap memory 
predictive consensus algorithm with respect to the network topology. The main contribution is an analytical 
expression for the mixing parameter a that achieves the minimum limiting convergence time (a concept 



W 3 [a] a9J. 
I 



(7) 
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defined below). This analytical expression involves only the second-largest eigenvalue of the original 
weight matrix W. In Section IIII-DI we describe an efficient distributed algorithm for estimating the 
second-largest eigenvalue. This means that there is only a relatively small overhead in initializing the 
predictive consensus algorithm with a very accurate approximation to the optimal mixing parameter. 

Section IIII-BI presents an analysis of the convergence rate of the two-tap memory predictor-based 
consensus algorithm when the optimal mixing parameter is used. We show how incorporating prediction 
affects the spectral radius, which governs asymptotic convergence behaviour. Our result provides a bound 
on how the spectral radius scales as the number of nodes in the network is increased. We discuss how this 
bound can be used to develop guidelines for selecting asymptotically optimal prediction parameters 0. 
The second set of results on convergence time, presented in Section [TlI-C[ characterizes a processing gain 
metric. This metric measures the improvement in asymptotic convergence rate achieved by an accelerated 
consensus algorithm (relative to the convergence rate achieved by standard distributed averaging using 
the original weight matrix). 

A. Optimal Mixing Parameter 

The mixing parameter a determines the influence of the standard one-step consensus iteration relative 
to the predictor in (l4al) . We assume a foundational weight matrix, W, has been specified, and proceed 
to determine the optimal mixing parameter a with respect to W. Before deriving an expression for the 
optimal a, it is necessary to specify what "optimal" means. Our goal is to minimize convergence time, 
but it is important to identify how we measure convergence time. 

Xiao and Boyd IfTOl show that selecting weights W to minimize the spectral radius p(W — J) (while 
respecting the network topology constraints) leads to the optimal convergence rate for standard distributed 
averaging. In particular, the spectral radius is the worst-case asymptotic convergence rate, 

„(W-J)= sup lim ( m^m 1 ''. (8 ) 

x(0)^x(0) Vll x (°) - x (0)||/ 

Maximizing asymptotic convergence rate is equivalent to minimizing asymptotic convergence time, 

(9) 



Iog(p(W-J)-i)' 

which, asymptotically, corresponds to the number of iterations required to reduce the error ||x(i) — x(0)|| 
by a factor of e -1 IfTOl . An alternative metric is the convergence time, the time required to achieve the 
prescribed level of accuracy e for any non-trivial initialization |[T8l : 

T c (W,s) = inf {r : ||x(t) -x(0)|| 2 < e||x(0) -x(0)|| 2 V t > r, V x(0) - x(0) ^ 0} , (10) 

T>0 
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In the case where W is symmetric, p(W — J) also defines the convergence time fl9l . The update 
matrix we propose, ([7]), is not symmetric and it may not even be contracting. For such matrices, and the 
spectral radius p(W — J) cannot, in general, be used to specify an upper bound on convergence time. 
We can, however, establish a result for the limiting e-convergence time, which is the convergence time 
for asymptotically small e. Specifically, in Section IV-AI we show that for matrices of the form ©, 

e^o loge -1 logp(*3[a] — J) 1 
According to this result, the convergence time required to approach the average within e-accuracy grows 
at the rate 1/ log p(<&3[a] — J) -1 as e — > 0. Minimizing the spectral radius is thus a natural optimality 
criterion. The following theorem establishes the optimal setting of a for a given weight matrix W, as a 
function of A2CW), the second largest eigenvalue of W. 

Theorem 1 (Optimal mixing parameter). Suppose W G M. NxN is a symmetric weight matrix satisfying 
convergence conditions (f2]) and |Ajv(W)| < A2CW), where the eigenvalues Ai(W) = 1, A2(W), . . . , Aat(W) 
are labelled in decreasing order. Suppose further that 63 + 62 + Q\ = 1 and #3 > 1, 62 > 0. Then the 
solution of the optimization problem 



a* = argminp(<l?3[a] — J) (12) 



is given by the following: 



a , = -((0 3 ~ 1)A 2 (W) 2 + fl 2 A 2 (W) + 2gi) - 2yg[ + giA 2 (W) (6 2 + (9 3 - 1)A 2 (W)) 

(0 2 + (0 3 -l)A 2 (W)) 2 

A brief discussion of the conditions of this theorem is warranted. The conditions on the predictor 
weights are technical conditions that ensure convergence is achieved. Two factors motivate our belief that 
these are not overly-restricting. First, these conditions are satisfied if we employ the least-squares predictor 
weights design strategy. Aysal et al. lfT3l describe a method for choosing the predictor coefficients based 
on least-squares predictor design. For the two-tap memory case, the predictor coefficients are identified 
as = At T B, where 



A^ 



T 

-2 -1 



(14) 



1 1 1 

B = [1,1] T , and A^ is the Moore-Penrose pseudoinverse of A. This choice of predictor coefficients 
satisfies the technical conditions on 6 in Theorem Q] above (9\ + 62 + 6*3 = 1 and #3 > 1,6* 2 > 0). 
Second, in Section IIII-BI we show that the choice of weights does not have a significant effect on the 
convergence properties, and asymptotically optimal weights also satisfy conditions on 6 in Theorem [T] 
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The condition on the weight matrix, |Ajv(W)| < A2(W), significantly reduces the complexity of 
the proof. Most distributed algorithms for constructing weight matrices (e.g., Metropolis-Hastings (MH) 
or max-degree) lead to W that satisfy the condition, but they are not guaranteed to do so. We can 
ensure that the condition is satisfied by applying a completely local adjustment to any weight matrix. 
The mapping W h-» 1/2(1 + W) transforms any stochastic matrix W into a stochastic matrix with all 
positive eigenvalues iTTTTl : this mapping can be carried out locally, without any knowledge of the global 
properties of W, and without affecting the order-wise asymptotic convergence rate as N — > oo. 

B. Convergence Rate Analysis 

We begin with our main result for the convergence rate of two-tap predictor-based accelerated con- 
sensus. Theorem |2] indicates how the spectral radius of the accelerated operator $3 [a] is related to the 
spectral radius of the foundational weight matrix W. Since the limiting e-convergence time is governed 
by the spectral radius, this relationship characterizes the improvement in convergence rate. 

Theorem 2 (Convergence rate). Suppose the assumptions of Theorem \J\ hold. Suppose further that the 
original matrix W satisfies p(W — J) < 1 — ^>{N) for some function ^ : N — > (0, 1) of the network 
size N. Then the matrix <&3[a*] satisfies p(3> 3 [a*] — J) < 1 - y/W{N). 

In order to explore how fast the spectral radius, p(<J>3[a*] — J) = y/—a*Q\, (see Section IV-CI for 
details) goes to one as N — > 00, we can take its asymptotic Taylor series expansion: 



From this expression, we see that the bound presented in Theorem [2] correctly captures the convergence 
rate of the accelerated consensus algorithm. Alternatively, leaving only two terms in the expansion above, 



We can also use (fT5T ) to provide guidelines for choosing asymptotically optimal prediction parameters 
6> 3 and 6 2 . In particular, it is clear that the coefficient 7(6*2,6*3) = ^[2(6 3 - 1) + 2 ]/[9 3 -1 + 6*2] 
should be maximized to minimize the spectral radius /)($3[q*] — J). It is straightforward to verify 
that setting #2 = and #3 = 1 + e for any e > satisfies the assumptions of Theorem Q] and also 
satisfies 7(0, 1 + e) > 7(6*2, 1 + e) for any positive 6*2. Since 7(0, 1 + e) = s/2 is independent of e (or 
6*3) we conclude that setting (6>i, 62, #3) = (— e, 0, 1 + e) satisfies the assumptions of Theorem Q] and 
asymptotically yields the optimal limiting e-convergence time for the proposed approach, as N — > 00. 

February 5, 2010 DRAFT 
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p($3[a*] — J) = 1 — Q(^fy(N)), we see that the bound presented is rate optimal in Landau notation. 
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C. Processing Gain Analysis 

Next, we investigate the gain that can be obtained by using the accelerated algorithm presented in this 
paper. We consider the ratio r asym (W)/r asym ($3[o;*]) of the asymptotic convergence time of the standard 
consensus algorithm using weight matrix W and the asymptotic convergence time of the proposed 
accelerated algorithm. This ratio shows how many times fewer iterations, asymptotically, the optimized 
predictor-based algorithm must perform to reduce error by a factor of e~ 1 . 

If the network topology is modeled as random (e.g., a sample from the family of random geometric 
graphs), we adopt the expected gain Q(W) = E{r aS y m (W)/r aS y m (*3[a*])} as a performance metric, 
where $3 [a*] is implicitly constructed using the same matrix W. The expected gain characterizes the 
average improvement obtained by running the algorithm over many realizations of the network topology. 
In this case the spectral radius, p(W — J), is considered to be a random variable dependent on the 
particular realization of the graph. Consequently, the expectations in the following theorem are taken 
with respect to the measure induced by the random nature of the graph. 

Theorem 3 (Expected gain). Suppose the assumptions of Theorem \J} hold. Suppose further that the 
original matrix W satisfies E{p(W — J)} = 1 — *$>(N) for some function \P : N — >■ (0, 1) of the network 
size N. Then 0(W) = l/y/V(N). 

We note that there is no loss of generality in considering the expected gain since, in the case of a 
deterministic network topology, these results will still hold (without expectations) since they are based 
on the deterministic derivations in Theorems 1 and 2. 

For a chain graph (path of N vertices) the eigenvalues of the Metropolis-Hastings (MH) weight matrix, 
W M h, constructed according to HJ (Wy = 1/(1 + max(dj, dj)) if j G Mi, i 7^ j; Wy = if j £ Mi\ 
and Wi,i = 1 - ZjeK W *,i) are § iven b Y Ai(W MH ) = 1/3 + 2/3cos(7r(f - 1)/N), i = l,2,...,N. 
This is straightforward to verify using Theorem 5 in |20| . For the path graph, the weight matrix W M h is 
tridiagonal and we have max(dj , dj ) = 2,Vi, j. Thus, in this case, /9(W M h — J) = 1/3 + 2/3 cos(7r/iV). 
For large enough N this results in p(W M H — J) ~ 1 — + 0(1/N 4 ). Using the same sequence of 

steps used to prove Theorem |3] above without taking expectations, we see that for the chain topology, the 
improvement in asymptotic convergence rate is asymptotically lower bounded by N; i.e., Q(W) = Q(N). 
Similarly, for a network with two-dimensional grid topology, taking W to be the transition matrix for 
a natural random walk on the grid (a minor perturbation of the MH weights) it is known ll2~Til that 
(1 — A2(W))~ 1 = Q(N). Thus, for a two-dimensional grid, the proposed algorithm leads to a gain of 
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g(w) = niN 1 / 2 ). 

This discussion suggests that the following result may also be useful in characterizing the improvement 
in asymptotic convergence rate obtained by using the proposed algorithm. 

Corollary 1. Suppose that assumptions of Theorem \3\ hold and suppose in addition that p(W — J) = 
1 — then the improvement in asymptotic convergence rate attained by the accelerated algorithm 

is Q(W) = Q(N^ 2 ). 

D. Initialization Heuristic: Decentralized Estimation of A2 (W) 

Under our assumptions, the optimal value of the mixing parameter depends only on the values of 
predictor coefficients and the second largest eigenvalue of initial matrix W. In this section we discuss 
a decentralized procedure for estimating A2CW). Since we assume the predictor weights, 6, and weight 
matrix W are fixed and specified, this is the only parameter that remains to be identified for a fully 
decentralized implementation of the algorithm. Estimation of A2CW) is a straightforward exercise if we 
employ the method of decentralized orthogonal iterations (DOI) proposed for distributed spectral analysis 
in E21 and refined for distributed optimization applications in ifTTTl . 

Algorithm Q] presents the proposed specialized and streamlined version of DOI, which is only used to 
calculate the second largest eigenvalue of the consensus update matrix W. Our underlying assumptions 
in Algorithm Q] are those of Theorem [1] in which case we have A2OW") = p(W — J). The eigenvalue 
shifting technique discussed after Theorem Q] can be employed whenever assumption |Ajy(W)| < A2CW) 
does not hold. The main idea of DOI, is to repeatedly apply W to a random vector vo, with periodic 
normalization and subtraction of the estimate of the mean, until vjf = W^vo converges to the second- 
largest eigenvector of W. Then, estimate the second-largest eigenvalue by calculating ||Wv#||/||vk-|| 
for a valid matrix norm || • ||. Previous algorithms for DOI iTTTTl . ll22l have normalized in step 6 by the 
£2 norm of v^, estimated by K iterations of consensus, and step 9 previously required an additional 
K iterations to calculate ||Wvk||2 an d \\^k ||2- In addition, because the initial random vectors in ifTTI . 
ll22l are not zero-mean, these algorithms must apply additional consensus operations to eliminate the 
bias (otherwise vjc converges to 1). Previous algorithms thus have 0{K 2 ) complexity, where K is the 
topology-dependent number of consensus iterations needed to achieve accurate convergence to the average 
value. For example, for a random geometric graph, one typically needs K oc N. 

The main innovations of Algorithm Q] are in line 2, which ensures that the initial random vector is zero 
mean, in line 6, where normalization is done (after every L applications of the consensus update) using 
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Algorithm 1: Spectral radius estimation (Input: foundational weight matrix W) 

1 Choose random vector v ; 

2 Set vo = Wv — v ; Generate zero-mean random vector 

3 for k = 1 to K do 

4 Vfc = Wvn ; Apply W to converge to second-largest eigenvector 

5 if k mod L = then 

6 Vfc = Vfc/|| v fc||oo ; Normalize by supremum norm every L iterations 

7 endif 

8 endfor 

9 Let A^(W) = HWvA-lloo/llvA-lloo ; 



the supremum norm, and line 9, where the supremum norm is also used in lieu of the £ 2 norrro (based on 
Gelfand's formula |[23l we have hnift;_ 5 . 00 || Wvr-||oo/||vs;||oo = p(W — J)). The maximum entry of the 
vector \k can be calculated using a maximum consensus algorithm, wherein every node updates its value 
with the maximum of its immediate neighbours: Xj(t) = maxj g ^. Xj(t— 1). Maximum consensus requires 
at most N iterations to converge for any topology; more precisely it requires a number of iterations equal 
to the diameter, D, of the underlying graph, which is often much less than N (and much less than K). 
Equally importantly, maximum consensus achieves perfect agreement. In the algorithms of iTTTTl . E2l 
each node normalizes by a slightly different value (there are residual errors in the consensus procedure). 
In Algorithm 1 , all nodes normalize by the same value, and this leads to much better estimation accuracy. 
Taken together, these innovations lead to an algorithm that is only 0{K) (with the appropriate choice 
of L). In particular, the complexity of Algorithm Q] is clearly 0(K + DK/L + D). Choosing L oc D 
(assuming that A2(W) D 3> A, where A is machine precision) we obtain an 0(K) algorithm. The 
proposed initialization algorithm has significantly smaller computation/communication complexity than 
the initialization algorithm proposed for the distributed computation of optimal matrix in ifTTl . 

IV. Numerical Experiments and Discussion 

This section presents simulation results for two scenarios. In the first simulation scenario, network 
topologies are drawn from the family of random geometric graphs of N nodes 1241 . In this model, N 

2 We have not observed any penalty for using the norm in our experiments. This observation is supported by the theoretical 
equivalence of l p norms in the consensus framework 1181 . 
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nodes are randomly assigned coordinates in the unit square, and links exist between nodes that are at most 
a distance a/2 log N/N. (This scaling law for the connectivity radius guarantees the network is connected 
with high probability 1241 .) Two models for the initial node measurements, x(0), are considered. In the 
"Slope" model, the initial value Xj(0) at node i is just the sum of its coordinates in the unit square. In the 
"Spike" model, all nodes are initialized to 0, except for one randomly chosen node whose initial value is 
set to one. All simulation results are generated based on 300 trials (a different random graph and node 
initialization is generated for each trial). The initial values are normalized so that the initial variance of 
node values is equal to 1. The second simulation scenario is for the iV-node chain topology. Intuitively, 
this network configuration constitutes one of the most challenging topologies for distributed averaging 
algorithms since the chain has the longest diameter and weakest connectivity of all graphs on N. For 
this topology, we adopt analogous versions of the "Slope" and "Spike" initializations to those described 
above; for the "Slope", x,(0) = i/N, and for the "Spike", we average over all locations of the one. 

We run the algorithm ./V times with different initializations of the eigenvalue estimation algorithm to 
investigate the effects of initializing a* with an imperfect estimate of A2CW). In simulations involving 
the calculation of convergence time we have fixed the required accuracy of computations, e, at the level 
—100 dB (i.e., a relative error of 1 x 10~ 5 ). For predictor parameters, we use (6±, 62, #3) = (— e, 0, 1 + e), 
e = 1/2, as these were shown to be asymptotically optimal in Section Ull-B I 

We compare our algorithm with two memoryless approaches, the Metropolis-Hastings (MH) weight 
matrix, and the optimal weight matrix of Xiao and Boyd iPTOl MH weights are attractive because they 
can be calculated by each node simply using knowledge of its own degree and its neighbors' degrees. 
We also compare to two approaches from the literature that also make use of memory at each node to 
improve the rate of convergence: polynomial filtering lfl4l . and finite-time consensus ifToll . 

We first plot the MSE decay curves as a function of the number of consensus iterations t for network size 
N = 200, RGG topology and different initializations. Figure Q] compares the performance of the proposed 
algorithm with the algorithms using the MH or the optimal weight matrix of Xiao and Boyd |[T0l . It can 
be seen that our decentralized initialization scheme does not have a major influence on the performance 
of our approach, as the method initialized using a decentralized estimate for A2CW) (the curve labelled 
MH-ProposedEst) and the method initialized using precise knowledge of A2 ( W) (labelled MH-Proposed) 
coincide nearly exactly since the procedure discussed in Section HTl-Dl provides a good estimate of A2CW) 

3 To determine the optimal weight matrix and optimal polynomial filter weights we used CVX, a package for specifying and 
solving convex programs 1251 . 
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Fig. 1. MSE vs. iterations for 200-node random geometric graphs. The algorithms compared are: optimal weight matrix of 
Xiao and Boyd (TO) (Opt): +; MH weights (MH): A; proposed method with oracle A2CW) and MH matrix (MH-Proposed): o; 
proposed with decentralized estimate of A2CW) (MH-ProposedEst): x; accelerated consensus, with oracle A2(W) and optimal 
matrix (Opt-Proposed): □. (a) Slope initialization, (b) Spike initialization. 



(to within 10~ 3 maximum relative error for a 200 node RGG). It is also clear that the proposed algorithm 
outperforms both the memory less MH matrix and the optimal weight matrix of Xiao and Boyd ifTOl . In 
this experiment we fixed K = 2N and L = 10. Note that the results in Figure 1 and all subsequent 
figures do not account for initialization costs. The initialization cost is relatively small. For the 200-node 
RGG it is equal to about 3iV = 600 consensus iterations (if we bound the diameter of the 200-node 
RGG by 20). If we desire a relative error of 10~ 3 , our algorithm gains approximately 70 iterations over 



memoryless MH consensus, based on Fig. |2(b)| For this desired accuracy, the initialization overhead is 
thus recovered after less than 10 consensus operations. 

Figure [2] compares the MSE curves for the proposed algorithm with two versions of polynomial filtering 
consensus |[T4l . one using 3 taps and the other using 7 taps. We see that in the RGG scenario, our 
algorithm outperforms polynomial filtering with 3 memory taps and converges at a rate similar to that of 
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Fig. 2. MSE vs. iteration for 200-node topologies, Slope initialization. The algorithms compared are: optimal weight matrix 
of Xiao and Boyd (To) (Opt): +; polynomial filter with 3 taps (MH-PolyFilt3): V an d 7 taps (MH-PolyFilt7): t>; proposed 
method with oracle A2CW) and MH matrix (MH-Proposed): o; proposed method with decentralized estiamte of A2CW) (MH- 
ProposedEst): x. 



the 7-tap version of polynomial filtering^. Decentralized calculation of topology-adapted polynomial filter 
weights also remains an open problem. We conclude that for random geometric graphs, our algorithm 
has superior properties with respect to polynomial filtering since it has better error performance for the 
same computational complexity, and our approach is suitable for completely distributed implementation. 
Moving our attention to the chain topology only emphasizes these points, as our accelerated algorithm 
significantly outperforms even 7-tap polynomial filtering. Note that decentralized initialization of our 
algorithm also works well in the chain graph scenario. However, to obtain this result we have to increase 
the number of consensus iterations in the eigenvalue estimation algorithm, K, from 2N to N 2 . This 
increase in the complexity of the distributed optimization of accelerated consensus algorithm is due to 
the properties of the power methods |[26l and related eigenvalue estimation problems. The accuracy of 

4 Calculating optimal weights in the polynomial filtering framework quickly becomes ill-conditioned with increasing filter 
length, and we were not able to obtain stable results for more than 7 taps on random geometric graph topologies. Note that 
the original paper (T4] also focuses on filters of length no more than 7. We conjecture that this ill-conditioning stems from the 
fact that the optimal solution involves pseudo-inversion of a Vandermonde matrix containing powers of the original eigenvalues. 
Since, for random geometric graph topologies, eigenvalues are not described by a regular function (e.g., the cosine, as for the 
chain graph) there is a relatively high probability (increasing with TV) that the original weight matrix contains two similar-valued 
eigenvalues which may result in the Vandermonde matrix being ill-conditioned. 



February 5, 2010 



DRAFT 



16 



110 
100 
90 
80 
g 70 
" 60 
50 
40 
30 
20 



MH-PolyFilt3 
-•-Opt 

-6>-MH-PolyFilt7 
-$-MH-Proposed 











50 



100 



N 



150 



200 





C 






4.5 





4 












3.5 








3 




2.5 


5 




I— 






2 




1.5 



-0-MH- 


-Proposed 


"-fr-MH- 


-PolyFilt7 


-t-Opt 




MH- 


-PolyFilt3 



50 



100 



N 



150 



200 



(a) 



(b) 



Fig. 3. Averaging time characterization, random geometric graph topologies. The algorithms compared are: optimal weight 
matrix of Xiao and Boyd 1101 (Opt): +; polynomial filter with 3 taps (MH-PolyFilt3): V' an d 7 ta P s (MH-PolyFilt7): >; proposed 
method with oracle A2CW) and MH matrix (MH-Proposed): o; proposed method with MH matrix and decentralized estimate 
of A2CW) (MH-ProposedEst): x. (a) Averaging time as a function of the network size, (b) Ratio of the averaging time of the 
non-accelerated algorithm to that of the associated accelerated algorithm. 



the second largest eigenvalue computation depends on the ratio As(W)/A2(W), and this ratio increases 
much more rapidly for the chain topology as N grows than it does for random geometric graphs. 

To investigate the robustness and scalability properties of the proposed algorithm, we next examine 
the averaging time, T ave ($3[a*]), as defined in (fTOb . and the ratio T ave (W)/T ave (&3[a*]), for random 
geometric graphs (Fig. [3]) and the chain topology (Fig.|4]). We establish through simulation that the scaling 
behaviour of the ratio that can be measured experimentally matches very well with the asymptotic result 
established theoretically for the processing gain, r asym (W)/r asym ($3[a*]). We see from Fig.[3]that in the 
random geometric graph setting, the proposed algorithm always outperforms consensus with the optimal 
weight matrix of Xiao and Boyd iflOl and polynomial filter with equal number of memory taps, and our 
approach scales comparably to 7-tap polynomial filtering. On the other hand, in the chain graph setting 
(Fig. [4]) the proposed algorithm outperforms all the competing algorithms. Another interesting observation 
from Fig. [4] is that the gains of the polynomial filter and optimal weight matrix remain almost constant 
with varying network size while the gain obtained by the proposed algorithm increases significantly with 
TV. This linear improvement with N matches well with the asymptotic behavior predicted by Theorem 3. 

Finally, we compare the proposed algorithm with the linear observer approach of Sundaram and 
Hadjicostis Ifl6l , which works by remembering all of the consensus values, xi{t), seen at a node i 
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Fig. 4. Averaging time characterization, chain topology. The algorithms compared are: optimal weight matrix of Xiao and 
Boyd I'lOj (Opt): +; polynomial filter with 3 taps (MH-PolyFilt3): V' an d 7 taps (MH-PolyFilt7): o; proposed method with 
oracle A2CW) and MH matrix (MH-Proposed): o. (a) Averaging time as a function of the network size, (b) Improvement due 
to the accelerated consensus: ratio of the averaging time of the non-accelerated algorithm to that of the associated accelerated 
algorithm. 



(unbounded memory). After enough updates, each node is able to perfectly recover the average by 
locally solving a set of linear equations. To compare the method of Ifl6l with our approach and the other 
asymptotic approaches described above, we determine the topology-dependent number of iterations that 
the linear-observer method must execute to have enough information to exactly recover the average. We 
then run each of the asymptotic approaches for the same number of iterations and evaluate performance 
based on the MSE they achieve. Figure [5] depicts results for both random geomettic graph and chain 
topologies. For random geometric graphs of N > 100 nodes, we observe that the proposed algorithm 
achieves an error of at most 10 _ 12 (roughly machine precision), by the time the linear observer approach 
has sufficient information to compute the average. For the chain topology the results are much more 
favourable for the linear-observer approach. However, the linear observer approach requires significant 
overhead to determine the topology-dependent coefficients that define the linear system to be solved at 
each node and does not scale well to large networks. 
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(a) Random geometric graph (b) Chain 

Fig. 5. MSE at the point when finite time consensus of Sundaram and Hadjicostis |161 has enough information to calculate 
the exact average at all nodes. The algorithms compared are: optimal weights (Opt): +; polynomial filter with 3 taps (MH- 
PolyFilt3): V> an d 7 taps (MH-PolyFilt7): t>; proposed method with oracle A2(W) and MH matrix (MH-Proposed): o. (a) 
Random geometric graph, (b) Chain topology. 



V. Proofs of Main Results and Discussion 
A. Limiting e-convergence time 

To begin, we need to motivate choosing a to minimize the spectral radius p($3[a] — J) since, unlike 
in the memoryless setting, it does not bound the step-wise rate of convergence. In fact, since 3>3 [a] is 
not symmetric, $3 [a]* does not even converge to J as t — > 00, as in the memoryless setting. However, 
we will show that: (i) for the proposed construction, $3 [a] 1 does converge to a matrix (ii) that the 
limiting convergence time is governed by p(<&3[a] — <&); and (hi) that p($3[a] — 4>) = p(<&3[a] — J). 

Before stating our first result we must introduce some notation. For now, assume we are given a matrix 
$ G R nxn with <& = limt-^oo <&*. We will address conditions for existence of the limit below. For a given 
initialization vector x(0) £ W 1 , let x(0) = <&x(0), and define the set of non-trivial initialization vectors 
X 0! & = {x(0) G R n : x(0) / x(0)}. Since we have not yet established that x(0) = x(0) = Jx(0), we 
keep the discussion general and use the following definition of the convergence time: 

T c (*,e) = inf {r : ||x(t) -x(0)|| 2 < e||x(0) -x(0)|| 2 Vt>r, Vx(0)e%} (16) 

T>0 

We now prove a result relating the spectral radius and the e-convergence time for general non-symmetric 
averaging matrices 3>, which we will then apply to our particular construction, $3 [a]. 
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Theorem 4. Let <& 6 W ixn be given, with limit limine = and assume that p($> — <&) > 0. 77ie?i 

T c (*,e) 1 



lim 



(17) 



* = T 



(18) 



e^o loge -1 logp($ — $) _1 
Proof: The limit Hindoo = exists if and only if (see E71 ) $ can be expressed in the form 

I« 
z 

where I K is the identity matrix of dimension k, Z is a matrix with /o(Z) < 1 and T is an invertible 
matrix. It follows that in the limit we have |[T5l . 

f Ik 



$ = lim = T 

t— >oo 







T- 1 . 



(19) 



By linear algebra, = = $ and = Using these facts it is trivial to show (<& — <&)* 
3>* — implying (<& — <&)*(x(0) — x(0)) = x(i) — x(0). Taking the norm of both sides we have 



|x(t)-x(0)|| a = ||(*-*)*(x(0)-x 



and therefore 



->o\ ||x(0) -x(0)|| 2 

By the definition of T c (<&, e) above we have: 

||(*-*)^)(x(0)-x(0))|| 2 



|x(0)-x(0)||: 



< e Vt>r, V x(0) G Af ,# 



< e, Vx(0) G Ab,#. 



(20) 



(21) 



(22) 



This implies: 



sup 

,x(o)e^ ,* 



($-5) r °(*.g)(x(0) -x(0))|| 2 ' 
||x(0)-x(0)|| 2 



1/T C (#, £ )- 



T c (#,e) 



(23) 



and so, using the definition of the induced operator norm, which is simply ||<&— <&|| 2 = sup x( - ) e ^ ||(<& 
*)(x(0) - x(0))|| 2 /||x(0) - x(0)|| 2 , after taking the logarithm on both sides of ([23l) . we havdl 

loge 



r c (*,e) > 



log||(*-*)^(*^)||2 /Tc(# ' £) 



(24) 



5 Since we are interested in asymptotic behaviour of the type e — ► 0, there is no loss of generality in supposing 
that e is sufficiently small so that the following holds: loge < 0, log||(* - *) T < ; (*' e ' || V T =(*> £ ) < q, and log||(* - 



,T c (*,e)-l||l/(Tc(* : e)-l) 



< 
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Since |23] p(* - *) < ||(* - ^H^* for any t > 0, it follows that 



T c (*,e)>- l °Z £ (25) 

logp($ - *) 

from which it is also clear that T c (3>, e) — > oo as e — > 0. 
Now, by the definition of T c (<&,e) in (|2TI) we also have 

3x(0)e ^*' ||x(0)-x(0)|| 2 >£ ' (26) 

implying, for the operator norm of (<& — <£) T =(*> £ ) _1 ; 

||(*-*) T «(*. e )- 1 || 2 >e. (27) 

From <T27]> and ED it follows ||(* - *) T ^*' £ )|| 2 < e < ||(* - | 2 and thus we can always 

pick /3 E [0, 1) such that the following holds: 

/9||(* - *) r ^*' £ )- 1 || 2 + (1 - /3)||(* - *f c(# ' e) || 2 = e. (28) 

Using the notation C Tc = ||(* - *) Tc(# ' e) || 2 /||(* - *) T «'(*> e )- 1 || 2 for the bounded number C Tc we 
conclude 

||(* _ $f.(*, B )-i\\ 3 (p + (i _ = e . (29) 

The boundedness of Ct c follows from the sub-multiplicativity of the operator norm, ||(3> — 3>) Tc (* ,£ )| | 2 < 
||* - *|| 2 ||(* - *) T ^*' £ )- 1 || 2 yielding < C Tc < ||* - *|| 2 - 
Using the technique used to switch from d23l to d24l ) we obtain from (|29l ): 



(T c (*, £) - 1) log ||(* - $)Tc(*, £ )-l||l/(Tc(0 >E )-l) = lQg£ _ lQg(/3 + (1 _ ^ )c% y m 

Dividing through by logs -1 log ||(* — $) Tc (*> £ ) _1 | |i/( Tc (*> e ) 1 ) ) an( q taking the limit as e — > we have 



lim — — — — r = lim 



e^O log^ 1 e^O Jog ||(* _ $)Te(#, £ )-l||l/(^(*- £ )- 1 ) 

r log(/3 + (l-/3)C Tc ) 1 

-lim \ T\ h hm -. (31) 

^°log||(* - *)^(*. £ )- 1 ||2 /(Tc( *' e)_1) loge- 1 ^ologe" 1 
Moving the limits on the right under the logs and using the fact that T c (<&,e) — > oo as e — > 0, we may 
employ Gelfand's formula (H, lim*-**, ||(* - *)'|| 1/ * = p(* - 

Um r c (*, £ ) = -i Um logos + (i - p)c Tc ) 



e^o logs- 1 loglim||(*-*) Tc (*' £ )- 1 ||2 /(To( * ,e)_1) ^bglimlK*-*)^*.^!^^'^^^ 



£-S>0 



+ um log(/? + (1 y Tj 



logp(*-*)~ 1 £-s>o logp(* - *)~ 1 loge - 
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Using C Tc < ||* - *||, log(/3 + (1 - P)C Te ) < | log(/3 + (1 - - *||)|, and taking into account the 
fact that 0< | log(/3 + (1 - /3)||* - *||)| < | log ||$ - *|||, V/3 G [0,1], we have 



lf?i*4< . 1 -, ,+lim I Mil*-*! 



e->o logs 1 logp($ — e^o logp(* — loge -1 

1 



(33) 



log p(<& — 

Combining the last inequality with (T25T ) completes the proof. ■ 
In order to apply the above result, we must establish that $3 [a] satisfies the conditions of Theorem [4] 
In doing so, we will also show that (i) for <& = $3 [a] and X(0) defined in ©, the limit <&X(0) = JX(0), 
so our approach indeed converges to the average consensus, and (ii) that the limiting convergence time 
is characterized by a function of p(<& 3 [o!] — J), which motivates choosing a to optimize this expression. 
(Recall, in this setting J is the 2N x 2N matrix with all entries equal to 1/2JV.) Note that the condition 
on a is necessary for 3> 3 [a]* to have a limit as t — > 00, as will be established in Section |V-Bi 

Proposition 1. Let <J> 3 [a] be defined as in 0, assume that the assumptions of Theorem [7] hold, and 
a G [0, -07 1 ). Then: 

(a) #3 [a] = Hindoo $3 [a]* exists, with $ 3 [a<]X(0) = JX(0) for all X(0) defined in ([6]), 

(b) p(*3[a] - $3 [a]) > 0, and 

( C ) Km WiKg) _ 1 

Proof: Proof of part (a). In Theorem 1 in lfl5ll . Johansson and Johansson show that the necessary and 
sufficient conditions for the consensus algorithm of the form #3 [a] to converge to the average are (JJ1) 
3>3[a]l = 1; (JJ2) g T &z[a] = g T for vector g T = [/?il T /32l T ] with weights satisfying fi\ + = 1", and 
(JJ3) p(3>3[a] — -^lg T ) < 1. If these conditions hold then we also have $3 [a] = -^lg T ITT31 implying 
X(0) = X(0). Condition (JJ1) is easily verified after straightforward algebraic manipulations using the 
definition of $3 [a] in (0, the assumption that #1 + #2 + #3 = 1, and recalling that W satisfies Wl = 1 
by design. To address condition (JJ2), we set fi\ = 1/(1 + a9\) and fa = ol6\/{1 + a0\). Clearly, 
Pi + 02 = 1> and it is also easy to verify condition (JJ2) by plugging these values into the definition of 
g, and using the same properties of $3 [a], the 6>j's, and W as above. 

In order to verify that condition (JJ3) holds, we will show here that p(<&3[a] — -^lg T ) = p(^s[a]—3). 
In Section |V-B I we show that p(<J?3[a] — J) < 1 if a G [0, — O^ 1 ), and thus condition (JJ3) is also satisfied 
under the assumptions of the proposition. To show that p($3[a] — -^lg T ) = p(<&3[a] — J), we prove 
a stronger result, namely that $3 [a] — ^flg T and $3 [a] — J have the same eigenspectra. Consider 
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the eigenvector v» of #3(0;] with corresponding eigenvalue Aj(<&3[a]). This pair solves the eigenvalue 
problem, <& 3 [a]vj = Ai(3> 3 [a])vj. Equivalently, expanding the definition of $3 [a], we have 



W 3 [a] atO il 
I 



Vi = Aj(* 3 [a]) 



I 
I 



(34) 



We observe that (1341 ) fits a modification of the first companion form of the linearization of a Quadratic 
Eigenvalue Problem (QEP) (see Section 3.4 in EH). The QEP has general form (A 2 M + AC + K)u = 0, 
where u is the eigenvector associated with this QEP. The linearization of interest to us has the form: 



-C -K 




Au 


- A 


M 







Au 


I 




u 







I 




u 



0. 



(35) 



The correspondence is clear if we make the associations: M = I, C = — W 3 [a] and K = — a6{L, 
A = Aj(<& 3 [a]) and Vj = [Aj(3> 3 [a])u T u T ] T . Eigenvectors v, that solve d34l thus have special structure 
and are related to Uj, the solution to the QEP, 

(Ai(* 3 [a]) 2 I - Ai(* 3 [a])W 3 [a] - aB^m = 0. (36) 

Because the first and third terms above are scaled identity matrices and the definition of W 3 [ct!] (see (O) 
also involves scaled identity matrices, we can simplify this last equation to find that any solution u« must 
also be an eigenvector of W. 

We have seen above, when verifying condition (JJ1), that 1 is an eigenvector of $3 [a] with corre- 
sponding eigenvalue Aj(<&3[a]) = 1. Observe that, from the definition of g and because /3i + fa = 1, 
we have (-^lg T )l = 1. Thus, («&3[a] — ^lg T )l = 0. Similarly, recalling that J = i^ll T , we have 
Jl = 1, and thus ($3 [a] — J)l = 0. By design, W is a doubly stochastic matrix, and all eigenvectors 
u of W with u 7^ 1 are orthogonal to 1. It follows that (-^lg T )vj = for corresponding eigenvectors 
Vi = [Ai(* 3 [a])u T u T ] T of * 3 [a], and thus (* 3 [a] - ^lg T )vj = * 3 [a]vj = Aj($ 3 [a])vj. Similarly, 
Jv,j = if Vj 7^ 1, and (*& 3 [a] — JQv, = Aj(<&3[a])vj. Therefore, we conclude that the matrices 
(#3(0:] — $3 [a]) and ($3 [a]— J) have identical eigenspectra, and thus p($ 3 [a] — ^lg T ) = /3($ 3 [a] — J). 

In Section IV-BI we show that /3(3> 3 [a] — J) < 1 if a G [0, — O^ 1 ), and thus the assumptions of the 
proposition, taken together with the analysis just conducted, verify that condition (JJ3) is also satisfied. 
Therefore, the limit lim^^ * 3 [a]* = * 3 [a] = ^lg T exists, and * 3 [a]X(0) = JX(0) for all X(0) 
defined in Q. 

Proofs of parts (b) and (c). In the proof of Lemma[T](see Section lV-Bl . it is shown that p(^ 3 [a] — J]) > 
—a9\. Thus, if a > and Q\ < 0, then part (b) holds. The assumptions 9\ + #2 + #3 = 1, #3 > 1, and 
62 > imply that 9\ < 0, and by assumption, a > 0. If a = or 9\ = 0, then the proposed predictive 
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consensus scheme reduces to memoryless consensus with weight matrix W (and the statement follows 
directly from the results of iTTOl . iFTTl ). Thus, part (b) of the proposition follows from the assumptions 
and the analysis in Lemma Q] below. By proving parts (a) and (b), we have verified the assumptions of 
Theorem[4]above. Applying the result of this Theorem, together with the equivalence of p(^^[a\ — j^lg T ) 
and p(3>3[a] — J), gives the claim in part (c), thereby completing the proof. ■ 

B. Proof of Theorem [7} Optimal Mixing Parameter 

In order to minimize the spectral radius of $3 [a] we need to know its eigenvalues. These can be 
calculated by solving the eigenvalue problem (l34l) . We can multiply (l36l ) by u[ on the left to obtain a 
quadratic equation that links the individual eigenvalues Aj(3>3[a]) and Aj(W 3 [a]): 

uj (A l (* 3 [«]) 2 I - Ai(* 3 N)W 3 [a] - aMH = 

Ai(* 3 [a]) 2 - A i (W 3 N)Ai(# 3 H) - a9i = 0. (37) 

Recall $3 [a] is a 2N x 2N matrix, and so $3 [a] has, in general, 2N eigenvalues - twice as many as 
W 3 [a]. These eigenvalues are the solutions of the quadratic (l37l) . and are given by 

A* ($3 [a]) = \ (\^NM) + \A«(W 3 [a]) 2 + 4a0i) 

A**(* 3 [a]) = \ (Ai(W 3 [a]) - VAi(W 3 [a]) 2 + 4a^) . (38) 

With these expressions for the eigenvalues of $3 [a], we are in a position to formulate the problem of 
minimizing the spectral radius of the matrix ($3 [a] — J), a* = argminp(<l> 3 [a] — J). It can be shown 

a 

that this problem is equivalent to 

a* = argminp($ 3 [a] — J) (39) 

The simplest way to demonstrate this is to show that p(^ 3 [a] — J) > p(3> 3 [0] — J) for any a < 0. Indeed, 
by the definition of the spectral radius we have that p($ 3 [a] —J) > AJ^^sfa]) and p(<fr 3 [0]— J) = A2OW). 
The latter is clear if we plug a = into (l38l ). Hence it is enough to demonstrate A^^sfa]) > A2CW). 
Consider the inequality A^^sfa]) — A2CW) > 0. Replacing A^^sfa]) with its definition according to 
(|38T ), rearranging terms and squaring both sides gives a6± > A2(W) 2 — A2(W)A2(W 3 [a]). From the 
definition of W 3 [a] in ([5]), it follows that A2(W 3 [a]) = (1 — a + a# 3 )A2(W) + 062- Using this relation 
leads to the expression a{9\ + (6*3 — 1)A2(W) 2 + #2A2(W)) > 0. Under our assumptions, we have 
03 - 1 > 0, 2 > and 9 l < 0. Thus 1 + (6» 3 - 1)A 2 (W) 2 + 6» 2 A 2 (W) < Q x + 3 - 1 + 2 = since 
A 2 (W) < 1. This implies that if a < 0, the last inequality holds leading to A|(*3[a]) > A 2 (W). Thus 
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for any a < the spectral radius p($3[a] — J) cannot decrease, and so we may focus on optimizing 
over a > 0. 

Now, the proof of Theorem Q] boils down to examining how varying a affects the eigenvalues of 
$3 [a] on a case-by-case basis. We first show that the first eigenvalues, Af (<&3[a]) and A**(<&3[a:]), are 
smaller than all the others. Then, we demonstrate that the second eigenvalues, A^^sfa]) and A2*(3>3[a]), 
dominate all other pairs, A*($3[a]) and A**(<&3[a]), for j > 2, allowing us to focus on the second 
eigenvalues, from which the proof follows. Along the way, we establish conditions on a which guarantee 
stability of the proposed two-tap predictive consensus methodology. 

To begin, we reformulate the optimization problem in terms of the eigenvalues of $3 [a]. We first 
consider AJ(*3[a]) and A"(* 3 [a]). Substituting Ai(W 3 [a]) = (1 - a + a9 3 ) + a6 2 we obtain the 



relationship y A| (W3 [a] ) + Aad\ = 1 1 + a8\ | and using the condition B\ < 0, we conclude that 

1,-aOi if 1 + a6i > a < -97 1 
\* 1 (* 3 [a]),\r(*z[a}) = { ~ \ (40) 

[ -aO u l if 1 + a9 1 < =► a > -9^\ 

We note that a > —O^ 1 implies |A**(*&3[a])| > 1, leading to divergence of the linear recursion involving 
$ 3 [a], and thus conclude that the potential solution is restricted to the range a < —O^ 1 . Focusing on 
this setting, we write A* ($3 [a]) = 1 and A**(<&3[a]) = — a6\. We can now reformulate the problem 
(1391 ) in terms of the eigenvalues of $3 [a]: 

a* = argmin max JAol, Aj(W)] (41) 

o>0 i=l,2,. ..N 

where 

IW(* 3 [a])|, » = 1 

Ji[a,\i(W)] = { (42) 

[max(|A|(* 3 [«])|,|Ar(* 3 H)|) i > 1. 
We now state a lemma that characterizes the functions Ji[a, Ai(W)]. 

Lemma 1. Under the assumptions of Theorem [7] 

f a i/2(_ 01 )i/2 if ae[a*,9^] 

Ji[a A»(W) = < I __ (43) 

{ \ (|Ai(W 3 N)| + VA 4 (W 3 [a]) 2 +4a0i) if a € [0,a*) 

where 

a * = -m - i)\i(w) 2 + e 2 Xi(w) + ggi) - ; ggwj (g + (g 3 - i)A,(w)J 

~ (02 + (0 3 - 1)A,(W)) 2 

Over the range a £ [0, -O^ 1 }, J- [a, A<(W)] > Ai(W)] /or i = 2, 3, . . . , N. 
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Proof: For i = 2, 3, . . . N, the eigenvalues A* ($3 [a]) and A** ($3 [a]) can admit two distinct forms; 
when the expression under the square root in (I38T ) is less then zero, the respective eigenvalues are complex, 
and when this expression is positive, the eigenvalues are real. In the region where the eigenvalues are 
complex, 



max(|A*(*3H)|,|Ar(*3[a])|) = 2 



A 4 (W 3 [a]) 2 + i 2 ( VMW 3 [a]) 2 + Aa6 l 



1/2 



(45) 



We note that (|45T ) is a strictly increasing function of a. Recalling that Aj(W3[a]) = (1 + a(9s — 
l))Aj(W) + a02 and solving the quadratic Aj(W3[a]) 2 + Aa9\ = 0, we can identify region, [a*, a**], 
where the eigenvalues are complex. The upper boundary of this region is 

-((0 3 - 1)A,(W) 2 + 2 A 4 (W) + 20i) + 2y/&{ + fliAi(W) (9 2 + (0 3 - l)Ai(W)) 



(46) 



^2 + (^3-l)Ai(W)) 2 

Relatively straightforward algebraic manipulation of ((44b and (l46l) leads to the following conclusion: if 
Ai(W) G [-1, 1], 9 2 > and 3 > 1, then < a* < -0jf 1 < a**. This implies that (|45]> holds in the 
region [a*, -9^]. 

On the interval a G [0, a*), the expression under the square root in (f38t is positive, and the corre- 
sponding eigenvalues are real. Thus, 

Ai ( W 3 [a] ) + VA i (W 3 [a]) 2 + 4a0i| if A, (W 3 [a] ) > 
-Ai(W 3 [a]) + VA. t (W 3 H) 2 + 4a6»i| if A 4 (W 3 [a])<0, 

(47) 



max(|A:(* 3 [«])|,|Ar(*3M 



or equivalently, max(|A*(* 3 [a])|, |Af (* 3 [<*])!) = \ (|Ai(W 3 [a])| + VMWgfa]) 2 + 4a0iJ . These re- 
sults establish the expression for J7i[a, A*(W)] in the lemma. 

It remains to establish that Ji[a,Ai(W)] is less than all other Ji[a, Aj(W)] in the region a G 
[0, -O1 1 ]. In the region a G [a*,-^ 1 ], we have -aO^ 1 < 1, implying that a l / 2 (-9i) 1 / 2 > -aB\ = 
Ji[a, Ai(W)]. In the region a G [0, a*), note that X i (W 3 [a]) 2 +4a9 1 >0^ |Ai(W 3 [a])| > 2(-a0i) 1 /2 > 
which implies that 



X - (|A l (W 3 [a])| + VA i (W 3 [a]) 2 + 4a0 1 ) > i (2(-a0 1 ) 1 / 2 + 



> {-a9 l ) 1/2 > -aOi = Ji[a,Ai(W)], 



(48) 



thereby establishing the final claim of the lemma. 
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The previous lemma indicates that we can remove J\\a, Ai(W)] from (|41T) . leading to a simpler 
optimization problem, a* = argmin max Ji\a, Aj(W)]. The following lemma establishes that we 

a>0 i=2,3,...Af 

can simplify the optimization even further and focus solely on J 2 [a, A 2 (W)]. 

Lemma 2. Under the assumptions of Theorem \J} Ji[a,Aj(W)] < J2[a, A2(W)] and a*[Aj(W)] < 
a2[A 2 (W)] /or i = 3, 4, . . . , N over the range a G [0, -0J -1 ]. 



Proof: Consider the derivative of a*[Aj(W)] in the range Aj(W) G [0, 1]: 

[40x (0 3 -l) "02 (02 + (03 -l)Ai(W))] 



-^_a*[A,(W)] = - ■ - L_ 



9Af(w) * L (e 2 + (e 3 -i)Xi(W)) 

9 l (-91 + 40x (0 3 - 1) + 2 (0 3 - 1) Ai(W) + 2 (0 3 - l) 2 A,(W) 2 



+ 



^01 (0i + Ai(W) (0 2 + (0 3 - 1) Ai(W))) 
It is clear that the multiplier outside the square brackets in the first line above is positive in the range 
Aj(W) G [0,1]. Furthermore, the first summand is negative. Under the conditions 02 > 0, 03 > 1, it 
can be established that the second summand is positive and exceeds the first summand in magnitude 
(see |[T9l for a complete derivation). We conclude that the derivative is positive, and thus a*[Aj(W)] is 
an increasing function over Aj(W) G [0, 1]. This implies that a* [Aj(W)] < a|[A2(W)] for any \ > 0. 

Algebraic manipulation of (011) leads to the conclusion that a*[— Aj(W)] < a*[A;(W)] for Aj(W) G 
[0, 1]. This implies that for positive A;, we have a* [-Aj(W)] < a*[A;(W)] < a%[\ 2 (W)]. We have thus 
shown that a*[Aj(W)] < a^[A 2 (W)] for any 3 < i < N under the assumption |Ajv(W)| < A 2 (W). 

Next we turn to proving that Ji[a, Aj(W)] < J 2 [a, A2CW)] for any 3 < i < N. We consider 
this problem on three distinct intervals: a G [0, a* [Aj(W)]), a G [a* [Aj (W)] , a%[\ 2 (W)]) and 
a G [a2[A 2 (W)] ) -0^ 1 ]. From the condition a*[Ai(W)] < a^[A 2 (W)] and <@3]> it is clear that on 
the interval a G [a 2 [X 2 (W)] } -O^ 1 ] we have Ji[a,Xi{W)] = J 2 [a, A 2 (W)] = a 1 / 2 (-0i) 1 / 2 . On the 
interval a G [a* [Ai(W)], a* 2 [\ 2 (W)}) we have Ji[a,Xi(W)] = a 1 / 2 (-0 i ) 1 / 2 and J 2 [a,\ 2 (W)} = 
\ (|A;(W 3 [a])| + VA i (W 3 [a]) 2 + 4a0 1 ). From ggj, we see that Ji[a, A;(W)] < J 2 [a, A 2 (W)]. 
On the first interval a G [0, a* [Aj(W)]), we examine the derivative of Ji[a, Aj(W)] w.r.t. Aj(W): 

d ww ., l + a(9 3 -l) ( A 4 (W) + a(0 2 + (0 3 -l)A J (W)) 

-Ji[a, Ai(W)J - 



1 -4a (0 2 + 3 - 1) + (Ai(W) + a (02 + (03 - 1) A,(W))) 2 

+ sgn [Ai(W) + a (0 2 + (0 3 - 1) A;(W))] ^ (49) 

We observe that the multiplier 1+ct (^"i) j s positive, and the expression under the square root is positive 
because a G [0, a£[A»(W)]). Additionally, Aj(W) + a (0 2 + (0 3 - 1) A;(W)) > under the assumption 
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Ai(W) > and 2 > 0, 9 3 > 1. Thus aAj ( W) ^[«, A»(W)] > for any A*(W) > and we 
have Ji[a,Xi(W)} < J 2 [a, X 2 (W)] for any < Aj(W) < A 2 (W). Finally, we note from <@3]> that 
i7j[a, Aj(W)] is an increasing function of |Aj(W 3 [a])| = |(1 + a(0 3 — l))Aj(W) + a0 2 |. Thus, to 
show that Ji[a,-\i(W)] < J { [a, Xi(W)] for < A;(W) < A 2 (W) it is sufficient to show that 
| - (1 + a (0 3 - l))Aj(W) + a9 2 \ < |(1 + a(9 3 - l))Aj(W) + a9 2 \. Under our assumptions, we have 

|(1 + a(9 3 - 1))A;(W) + a9 2 \ 2 - | - (1 + a(9 3 - 1))A«(W) + a9 2 \ 2 

= 4(1 + a{9 3 - l))A;(W)a0 2 > 0. (50) 

This implies that Ji[a, Aj(W)] < J 2 [a,X 2 (W)\ on the interval a G [0, a*[Aj(W)]), indicating that the 
condition applies on the entire interval a G [0, —9^ l \, which is what we wanted to show. ■ 
The remainder of the proof of Theorem Q] proceeds as follows. From Lemmas Q] and |2j the optimization 
problem (fl2l ) simplifies to: a* = argmin^fa, A 2 (W)]. We shall now show that a 2 is a global minimizer 

»>0 

of this function. Consider the derivative of J 2 [a, A 2 (W)] w.r.t. a on [0, a* 2 ): 

d_ a 2 (w)] = 261 + (02 + {03 - 1} A2(w)) (A2(W) + - {02 + (03 ~ 1} A2(w))) 

9a ' ^4a0i + (A 2 (W) + a (9 2 + (0 3 - 1) A 2 (W))) 2 

+ (02 + (03 - 1) A 2 (W)) sgn [A 2 (W) + a (0 2 + (0 3 - 1) A 2 (W))] . 

Denote the first term in this sum by <^i(A 2 (W), a) and the second by </? 2 (A 2 (W), a). It can be shown 
that |(/?i(A 2 (W),a)| > |</? 2 (A 2 (W), a)\ for any A 2 (W) G [-1,1] and a G [0, a^) by directly solving the 
inequality. We conclude that the sign of the derivative on a G [0, a 2 ) is completely determined by the 
sign of <£>i(A 2 (W), a) for A 2 (W) G [—1, 1]. On a G [O,^)' th e s i§ n or " ¥>i(A 2 (W),a) is determined by 
the sign of its numerator. The transition point for the numerator's sign occurs at: 

+ = 20! + A 2 (W)(0 2 + (0 3 -l)A 2 (W)) 
a (02 + (03 - 1)A 2 (W))2 

and by showing that a + > —9^ 1 , we can establish that this transition point is at or beyond a 2 . This 
indicates that ipi(X 2 (W),a) < if a G [0, a 2 ). We observe that l 7 2 [«,A 2 (W)] is nonincreasing on 
a G [O,^) an( i nondecreasing on a G [a 2 ,— 0^ 1 ) (as established in Lemma [Q). We conclude that 
a 2 is a global minimum of the function J 2 [a, A 2 (W)], thereby proving Theorem Q] and establishing 
J 2 [a\X 2 (W)} = |A5(* 3 [a*])| = ^^9[. 

Note that the last argument also implies that J 2 [a, A 2 (W)] < A 2 (W) on a G [0, a 2 \ and J 2 [a, A 2 (W)] < 
1 on a G (a 2 , — 0f 1 ) since J 2 [a, A 2 (W)] is non-increasing on the former interval, it is non-decreasing on 
the latter interval and J 2 \— 0f , A 2 (W)] = 1. This fact demonstrates that the matrix 3> 3 [a] is convergent 
if a G [0, — 9^ ) in the sense that we have p($ 3 [a] — J) < 1. 
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C. Proof of Theorem [2} Convergence Rate 

Proof: According to the discussion in Sections IIII-AI and IV-BI , we have 

p(# 3 [a*] " J) = |A2(* 3 [a*])| = KN) 1/2 

-JSh - 1)A| + 2 A 2 + 2gi) - Vg + 0iA 2 (0 2 + (g 
(02 + (^3 - 1)A 2 ) 2 

In order to prove the claim, we consider two cases: A 2 (W) = 1 — ^f(N), and A 2 (W) < 1 — ty(N). 



1/2 



First, we suppose that A 2 (W) = 1 - *(iV) and show that p(& 3 [a*] - J) 2 - (1 - ^/¥{N)) 2 < 0. 
Denoting ^(N) = 5 and substituting A 2 (W) = 1 — 5 and B\ = 1 — # 2 — 6*3, we obtain 

2 



P(*s[a* 



(1 - vw))' 



x (gg ~ 1) (g ~ <$) + 2^ (0 3 + g 2 ~ 1) ~ 2y/(5 (0 2 + (2 - 5) (fl 3 - 1)) (fl 3 + 2 - 1) 

[(2 - 5)5 + 1](1 - 3 ) - (1 + <5)0 2 - 2y/5 (0 3 + 2 -l) ((0 3 - 1)(2 - 8) + 2 ) 
It is clear from the assumptions that the expressions under square roots are non-negative. Furthermore, the 
denominator is negative since 1 — 03 < 0, 02 > and 5 € (0, 1). Finally, note that (03 — 1) (5 2 — 5) < 
and 2^/5 (0 3 + 2 - 1) > 0. Thus, to see that the numerator is non-positive, observe that 

2 



[VS (03 + 02 - I)] 2 " [y/5 (02 + (2 - 5) (03 - 1)) (0 3 + 02 - 1) 
= (5 - 1)5(03 -l)(0 3 + 02 -l) <0. 



(51) 



Thus, we have p($ 3 [a*] - J) 2 - (1 - y/^(N)) 2 < 0, implying that p(& 3 [a*] - J) < 1 - y/^(N) if 
A 2 (W) = 1 - <S?(N). 

Now suppose A2OW) < 1 — $!(N). We have seen in Lemma [2] that a*[Aj(W)] is an increasing 
function of A;(W), implying a|[A 2 (W)] < a* 2 [l - V(N)]. Since p(* 3 [a*] - J) = (a*|0i|) 1/2 = 
(^[^(W)]^!!) 1 / 2 is an increasing function of a|[A 2 (W)], the claim of theorem follows. ■ 



D. Proof of Theorem \3\ Expected Gain 

Proof: First, condition on a particular realization of the graph topology, and observe from the 
definition of r asym (-) that 

r asym (W) _ logp($ 3 [a*] - J) 



r Mym (* 3 [a*]) logp(W-J) 
Next, fixing p(W — J) = 1 — tp, where *&(N) = K{xp}, and using Theorem [2l we have 

r asym (W) > log(l - y^) 



T aS ym(*3[a*]) log(l-V) 



(52) 



(53) 
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Let f(x) = log(l — y / x)/log(l — x). Taking the Taylor series expansion of f(x) at x = 0, we obtain 

f( x ) = _L + I _ I X V2 _ l x 3/2 _ (54) 

iU 71 2 6 20 



Noting that x > we conclude that the following holds uniformly over x G [0, 1]: /(x) < + |. At 
the same time, taking the Taylor series expansions of the numerator and denominator of f(x), we obtain 

/(X) = ^l^"^^^^- , ^ 
X+^ + f + -.. 

Noting that 1/6 + 1/3 = 1/2, 2/15 + 1/5 = 1/3, we can express this as 

*' V / /~~T , — ,.3/2 T 3/2 9^5/2 T5/2 ) V"/ 

and using the fact that l/2x > l/6x 3 / 2 , l/4x 2 > 2/15x 5//2 , . . . uniformly over x S [0, 1], we conclude 
that f(x) > Thus, < f(x) < + ^, where both bounds are tight. Finally, observe that 
^ x -i/2 _ 3/4 x ~5/2 > if x > 0, implying that l/\/x is convex. To complete the proof we take the 
expectation with respect to graph realizations and apply Jensen's inequality to obtain 

E{ T T (W U>e{^|>^. (57) 

m 

VI. Concluding Remarks 

This paper provides theoretical performance guarantees for accelerated distributed averaging algorithms 
using node memory. We consider acceleration based on local linear prediction and focus on the setting 
where each node uses two memory taps. We derived the optimal value of the mixing parameter for 
the accelerated averaging algorithm and discuss a fully-decentralized scheme for estimating the spectral 
radius, which is then used to initialize the optimal mixing parameter. An important contribution of this 
paper is the derivation of upper bounds on the spectral radius of the accelerated consensus matrix. This 
bound relates the spectral radius growth rate of the original matrix with that of the accelerated consensus 
matrix. We believe that this result applies to the general class of distributed averaging algorithms using 
node state prediction, and shows that, even in its simplified form and even at the theoretical level, 
accelerated consensus may provide considerable processing gain. We conclude that this gain, measured 
as the ratio of the asymptotic averaging time of the non-accelerated and accelerated algorithms, grows 
with increasing network size. Numerical experiments confirm our theoretical conclusions and reveal the 
feasibility of online implementation of the accelerated algorithm with nearly optimal properties. Finding 
ways to analyze the proposed algorithm in more general instantiations and proposing simpler initialization 
schemes are the focus of ongoing investigation. 
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